knitr::opts_chunk$set(comment =NA)library(janitor)library(naniar)library(broom)library(gt)library(gtsummary)library(mosaic)library(here) # introduced todaylibrary(conflicted) # introduced todaylibrary(tableone) # building Table Onelibrary(survey) # survey-specific tools for weightinglibrary(nhanesA) # data sourcelibrary(easystats)library(tidyverse)theme_set(theme_bw()) conflicts_prefer(dplyr::filter(), base::max(), base::sum(), base::mean())
Today’s Agenda
The conflicted package
The here package
Building Table One
Working with Survey Weights
What does the here package do?
The here package creates paths relative to the top-level directory. The package displays the top-level of the current project on load or any time you call here().
The goal of conflicted is to provide an alternative conflict resolution strategy. R’s default conflict resolution system gives precedence to the most recently loaded package. This can make it hard to detect conflicts, particularly when introduced by an update to an existing package. conflicted takes a different approach, making every conflict an error and forcing you to choose which function to use.
Conflict Resolution here
In the package setup today, I used the following code to resolve errors that came up in these slides:
The bradley.csv data set is simulated, but consists of 1,374 observations (687 Cases and 687 Controls) containing:
a subject identification code, in subject
status (case or control)
age (in years)
sex (Male or Female)
race/ethnicity (white or non-white)
married (1 = yes or 0 = no)
location (ICU, bed, other)
The bradley.csv data closely match the summary statistics provided in Table 1 of the Bradley et al. article. Our job is to recreate that part of Table 1, as best as we can.
The bradley.csv data (first 5 rows)
The bradley_sim.md file on our web site shows you how I simulated the data.
To “Live” Coding
On our web site (Data and Code + Class 12 materials)
In the data folder:
bradley.csv data file
bradley_table1.qmd Quarto script
bradley_table1.md Results of running Quarto
bradley_table1_result.csv is the table generated by that Quarto script
Weights for each sampled person in NHANES account for the complex survey design. The weight describes the number of people in the population represented by the sampled person, and is created in three steps:
the base weight is computed, which accounts for the unequal probabilities of selection given that some demographic groups were over-sampled;
adjustments are made for non-response; and
post-stratification adjustments are made to match estimates of the U.S. civilian non-institutionalized population available from the Census Bureau.
The DEMO file contains two kinds of sampling weights:
the interview weight (WTINT2yr), and
the MEC exam weight (WTMEC2yr)
NHANES also provides several weights for subsamples. In NHANES, we identify the variable of interest that was collected on the smallest number of respondents. The sample weight that applies to that variable is the appropriate one to use. In our first case, we will study total cholesterol and use the weights from the MEC exam.
What Variables Do We Need?
SEQN = subject identifying code
RIAGENDR = sex (1 = M, 2 = F)
RIDAGEYR = age (in years at screening, topcode at 80)
DMQMILIZ = served active duty in US Armed Forces (yes/no)
RIDSTATR = 2 if subject took both interview and MEC exam
WTMEC2YR - Full sample 2 year MEC exam weight
LBXTC = Total Cholesterol (mg/dl) - this is our outcome
The first 5 are in DEMO_H, and the first and last are in TCHOL_H.
Merge the DEMO and TCHOL files
dim(demo_raw)
[1] 10175 47
dim(tchol_raw)
[1] 8291 3
joined_df <-inner_join(demo_raw, tchol_raw, by =c("SEQN"))dim(joined_df)
[1] 8291 49
Create and save a small analytic tibble
nh1314 <- joined_df |># has n = 8291tibble() |>filter(complete.cases(LBXTC)) |># now n = 7624filter(RIDSTATR ==2) |># still 7624filter(RIDAGEYR >19& RIDAGEYR <40) |># now n = 1802filter(DMQMILIZ <3) |># drop 7 = refused, n = 1801mutate(FEMALE = RIAGENDR -1,AGE = RIDAGEYR,US_MIL =ifelse(DMQMILIZ ==1, 1, 0),WT_EX = WTMEC2YR,TOTCHOL = LBXTC) |>select(SEQN, FEMALE, AGE, TOTCHOL, US_MIL, WT_EX)write_rds(nh1314, here("c12/data/nh1314.Rds"))
The nh1314.Rds file is on our 432-data page if you need it.
nh1314 analytic sample
nh1314 |>select(AGE, TOTCHOL, WT_EX) |>summary()
AGE TOTCHOL WT_EX
Min. :20.00 Min. : 69 Min. : 8430
1st Qu.:24.00 1st Qu.:156 1st Qu.: 24694
Median :30.00 Median :178 Median : 34642
Mean :29.47 Mean :181 Mean : 44529
3rd Qu.:34.00 3rd Qu.:203 3rd Qu.: 59561
Max. :39.00 Max. :417 Max. :125680
nh_design <-svydesign(id =~ SEQN,weights =~ WT_EX,data = nh1314) nh_design <-update( nh_design, one =1) ## this one = 1 business will help us countnh_design
Independent Sampling design (with replacement)
update(nh_design, one = 1)
ggplot(res, aes(x = female, y = MEAN, col = type)) +geom_point(size =4) +facet_wrap(~ us_mil)
Building Models and Survey Weights
Modeling TOTCHOL in nh1314
First, we’ll ignore weighting, and fit a model with main effects of all three predictors (mod1), then a model (mod2) which incorporates an interaction of FEMALE and US_MIL.
mod1 <-lm(TOTCHOL ~ AGE + FEMALE + US_MIL, data = nh1314)mod2 <-lm(TOTCHOL ~ AGE + FEMALE * US_MIL, data = nh1314)
The interaction term means that the effect of FEMALE on TOTCHOL depends on the US_MIL status.
Estimate the percentage of the US non-institutionalized adult population within the ages of 30-59 who have smoked at least 100 cigarettes in their life that would describe their General Health as either “Excellent” or “Very Good”.
and you’ll do this without and then with sampling weights; due on Wednesday.
## By sex and ageourSummary(~ Depression, ~ Sex + Age_group, NH_des_2)
Sex Age_group counts Depression se
1 M 20-39 1654 5.513778 0.6461045
2 F 20-39 1674 10.050321 0.8036891
3 M 40-59 1556 5.222060 0.7699895
4 F 40-59 1751 11.477238 1.2011361
5 M 60+ 1611 6.052782 0.8295114
6 F 60+ 1696 9.579923 1.0534115
Compare Prevalence between Male and Female
Across all age groups:
svyttest(Depression ~ Sex, NH_des_2)
Design-based t-test
data: Depression ~ Sex
t = 6.8246, df = 29, p-value = 1.706e-07
alternative hypothesis: true difference in mean is not equal to 0
95 percent confidence interval:
3.416354 6.340267
sample estimates:
difference in mean
4.87831
Design-based t-test
data: Depression ~ Sex
t = 3.8688, df = 29, p-value = 0.0005706
alternative hypothesis: true difference in mean is not equal to 0
95 percent confidence interval:
2.948407 9.561949
sample estimates:
difference in mean
6.255178
Design-based t-test
data: Depression ~ Age_group
t = 0.79398, df = 29, p-value = 0.4337
alternative hypothesis: true difference in mean is not equal to 0
95 percent confidence interval:
-1.079836 2.450262
sample estimates:
difference in mean
0.6852129